feat(prepro): assign segment/subtype using nextclade sort (3/n) by anna-parker · Pull Request #5402 · loculus-project/loculus

anna-parker · 2025-11-10T15:48:06Z

resolves #4847

Screenshot

Improves #4821, comes after #5398

You can use pathoplexus/dev_example_data#2 for testing.

Nextclade sort will be used to assign segments/subtypes for all aligned sequences:

minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>

For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format _ (as in current set up).

As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: sequenceNameToFastaHeaderMap. This allows us to surface this assignment on the edit page.

Prepro config changes

Instead of having a dictionary for the nextclade datasets and servers we make nucleotideSequences a list of sequences:

nextclade_dataset_name: 
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]

nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output

Note the templates now also generate the genes list from the merged config.

PR Checklist

Update values.schema.json
keep tests for alignment NONE case
Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator
Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested
Have preprocessing send back a segment: fastaHeader mapping

Future Work

add integration testing for full EV submission user journey
improve CCHF minimizer (some segments are again not assigned)
discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key)
update PPX docs with new multi-segment submission format

🚀 Preview: https://sort-multi-path.loculus.org

kubernetes/loculus/values.yaml

anna-parker · 2025-11-18T09:09:19Z

kubernetes/loculus/values.yaml

          <<: *preprocessingConfigFile
-          log_level: INFO
-          nextclade_dataset_name: community/hodcroftlab/enterovirus/enterovirus/linked
+          minimizer_index: "https://raw.githubusercontent.com/alejandra-gonzalezsanchez/loculus-evs/master/evs_minimizer-index.json"


@corneliusroemer idea: add a flag that we are multipath and that the genes need to have the subtype as a prefix

preprocessing/nextclade/src/loculus_preprocessing/prepro.py

resolves # to be merged into #5402 when it works ### Screenshot As prepro now assigns the segments the website does not know which entry in the original data corresponds to which segment - this updates prepro to also return a map with this information. ### PR Checklist - [ ] All necessary documentation has been adapted. - [ ] The implemented feature is covered by appropriate, automated tests. - [ ] Any manual testing that has been done is documented (i.e. what exactly was tested?) 🚀 Preview: https://improve-endpoints.loculus.org

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

website/src/components/Edit/SequencesForm.tsx

backend/src/main/kotlin/org/loculus/backend/api/SubmissionTypes.kt

resolves # When submitting via the single page submission users now need to submit with the fastaHeader in the format `submissionId_segment` - i.e. the same requirements we have for the multi segment submission. - [ ] All necessary documentation has been adapted. - [ ] The implemented feature is covered by appropriate, automated tests. - [ ] Any manual testing that has been done is documented (i.e. what exactly was tested?) 🚀 Preview: https://edit-page-anya-2.loculus.org

…ed organisms

… same as in ingest)

… refactor multi segment submission in backend and edit page and have prepro assign segments (#5382) resolves #4999 #4708, #4734, #5511 partially resolves #5392, #5185 (comment) includes work done in #5398 and #5402 This PR additionally fixes submission, subtype assignment and search for EVs and other multi-path organisms. ### BREAKING CHANGES When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional `fastaIds` column with a space -separated list of the `fastaId`s (fasta header IDs) of the respective sequences. If no `fastaIds` column is supplied the `submissionId` will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata `submissionId` to `fastaId`. This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings) Nextclade sort (uses a minimizer index for fast local alignment) or nextclade align (full sequence alignment to reference) will be used to assign segments/subtypes for all multi-segmented and multi-pathogen sequences (this is also done in ingest for grouping segments): ``` segment_classification_method: "minimizer" or "align" minimizer_url: <url_to_minimizer_index_used_by_nextclade_sort> ``` For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format `<submissionId>_<segmentName>` (as in current set up). As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaId in the processedData, the map is called: `sequenceNameToFastaId`. This allows us to surface the segment assignment on the edit page. ### Nextclade Preprocessing pipeline config changes Instead of having a dictionary for the nextclade datasets and servers we make `nucleotideSequences` a dictionary where each item includes all information required to run nextclade. I.e. we change from: ``` nextclade_dataset_name: L: nextstrain/cchfv/linked/L M: nextstrain/cchfv/linked/M S: nextstrain/cchfv/linked/S nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output genes: [RdRp, GPC, NP] ``` to: ``` nextclade_sequence_and_datasets: - name: L nextclade_dataset_name: nextstrain/cchfv/linked/L nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq> accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name and name are used> gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix > genes: [RdRp] - name: M nextclade_dataset_name: nextstrain/cchfv/linked/M genes: [GPC] - name: S nextclade_dataset_name: nextstrain/cchfv/linked/S genes: [NP] nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output segment_classification_method: <optional, default for multi segmented viruses is align - if you assign segments in ingest for grouping use the same option here as you use there e.g. "minimizer" or "align"> minimizer_url: <optional, url_to_minimizer_index_used_by_nextclade_sort> ``` ### Ingest Pipeline Config changes `minimizer_index` is changed to `minimizer_url` for consistency (can be used in ingest and preprocessing and should both be the same) ### Optional additional Config changes Limit the number of sequences the backend will accept per submission by using - should be added for multi-segmented organisms: ` submissionDataTypes: &defaultSubmissionDataTypes consensusSequences: true maxSequencesPerEntry: 1 ` ### Testing You can use pathoplexus/example_data#16 and pathoplexus/dev_example_data#2 for testing. ### PR Checklist - [x] Update values.schema.json and other READMEs - [x] add fastaId to commonMetadata (ensure it is downloaded in templates): #5561 - [x] Fix how genes are returned (will cause a config update): #5563 - [x] Improve prepro code (less duplication and more tests): #5554 - [x] ingest EVs as single segmented to ensure search works: #5511 - [x] keep tests for alignment NONE case - [x] Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator - [x] Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested - [x] Have preprocessing send back a segment: fastaHeader mapping - ~add integration testing for full EV submission user journey~ -> will be done in a later PR - [x] improve CCHF minimizer (some segments are again not assigned) - [x] discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key) -> decided against - [x] update PPX docs with new multi-segment submission format -> test PR here: pathoplexus/pathoplexus#759 - [x] update example data for demo 🚀 Preview: https://edit-page-anya.loculus.org --------- Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com> Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com> Co-authored-by: Theo Sanderson <theo@sndrsn.co.uk>

anna-parker changed the title ~~feat(prepro): start assigning segment using nextclade sort~~ feat(prepro): assign segment using nextclade sort Nov 10, 2025

anna-parker mentioned this pull request Nov 10, 2025

feat!(prepro, config): assign segment with nextclade sort #4783

Closed

3 tasks

anna-parker added the preview Triggers a deployment to argocd label Nov 10, 2025

anna-parker force-pushed the sort-multi-path branch from 4966265 to 95813b2 Compare November 11, 2025 08:54

anna-parker force-pushed the multi-segment-submission-2 branch from ffac38a to 87fbe02 Compare November 11, 2025 08:57

anna-parker commented Nov 17, 2025

View reviewed changes

kubernetes/loculus/values.yaml Outdated Show resolved Hide resolved

anna-parker mentioned this pull request Nov 17, 2025

feat!(backend): refactor multi-segment submission (2/n) #5398

Merged

9 tasks

anna-parker commented Nov 18, 2025

View reviewed changes

preprocessing/nextclade/src/loculus_preprocessing/prepro.py Outdated Show resolved Hide resolved

anna-parker mentioned this pull request Nov 19, 2025

update processed data endpoint #5460

Merged

3 tasks

anna-parker changed the title ~~feat(prepro): assign segment using nextclade sort~~ feat(prepro): assign segment/subtype using nextclade sort Nov 19, 2025

anna-parker force-pushed the sort-multi-path branch 2 times, most recently from 77f4b06 to 5e0cd35 Compare November 19, 2025 17:04

anna-parker marked this pull request as ready for review November 19, 2025 18:56

chatgpt-codex-connector bot reviewed Nov 19, 2025

View reviewed changes

website/src/components/Edit/SequencesForm.tsx Show resolved Hide resolved

anna-parker force-pushed the sort-multi-path branch from fbb92f0 to 8ecc3af Compare November 19, 2025 20:30

corneliusroemer changed the title ~~feat(prepro): assign segment/subtype using nextclade sort~~ feat(prepro): assign segment/subtype using nextclade sort (3/n) Nov 20, 2025

anna-parker force-pushed the multi-segment-submission-2 branch from 87fbe02 to a61ec9b Compare November 20, 2025 09:54

anna-parker mentioned this pull request Nov 20, 2025

feat!(website, prepro, backend, config, integration):multi pathogen - refactor multi segment submission in backend and edit page and have prepro assign segments #5382

Merged

13 tasks

anna-parker force-pushed the sort-multi-path branch 2 times, most recently from 6f073fc to b3cd475 Compare November 20, 2025 10:40

corneliusroemer reviewed Nov 20, 2025

View reviewed changes

backend/src/main/kotlin/org/loculus/backend/api/SubmissionTypes.kt Show resolved Hide resolved

anna-parker added 7 commits November 20, 2025 12:26

feat!(backend): use column fastaId in metadata to group multi-segment…

21ae970

…ed organisms

feat(backend): fix merge conflicts

13be740

feat(prepro): start assigning segment using nextclade sort

1b31cb6

fix submission helper on website

97adcd3

improve minimizer for CCHF

bf6c9c7

feat: add back tests for None case

861e881

anna-parker added 7 commits November 20, 2025 12:26

feat: rename config value minimizer_url to minimizer_index (to be the…

aa2f2d7

… same as in ingest)

feat: update integration tests

c92074a

feat(prepro): add map from segmentName to fastaHeader

b48cdab

feat: update backend endpoint

02975e9

update edit page

371750e

merge conflict

98e2482

hmmmm

c9b5f6c

anna-parker force-pushed the sort-multi-path branch from ab2b5db to c9b5f6c Compare November 20, 2025 11:26

formatting

ce0ce37

anna-parker merged commit c9afc61 into multi-segment-submission-2 Nov 20, 2025
40 of 42 checks passed

anna-parker deleted the sort-multi-path branch November 20, 2025 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(prepro): assign segment/subtype using nextclade sort (3/n)#5402

feat(prepro): assign segment/subtype using nextclade sort (3/n)#5402
anna-parker merged 15 commits intomulti-segment-submission-2from
sort-multi-path

anna-parker commented Nov 10, 2025 •

edited by loculus-bot

Loading

Uh oh!

Uh oh!

anna-parker Nov 18, 2025

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

anna-parker commented Nov 10, 2025 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Screenshot

Prepro config changes

PR Checklist

Future Work

Uh oh!

Uh oh!

anna-parker Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

anna-parker commented Nov 10, 2025 •

edited by loculus-bot

Loading